class: center, middle, inverse, title-slide # APSTA-GE 2003: Intermediate Quantitative Methods ## Lab Section 003, Week 10 ### New York University ### 11/10/2020 --- ## Reminders - Sample Assignment Solutions Available - In [this Google Drive folder](https://drive.google.com/drive/folders/1b3fHkxC81ov2Sw8uPomXpXHyar2w2eQu?usp=sharing) - No Assignment Due - <( ̄︶ ̄)> - Office hours - Monday 9:00 - 10:00am (EST) - Wednesday 12:30 - 1:30pm (EST) - Additional time slots available - Sign-up sheet [HERE](https://docs.google.com/spreadsheets/d/1YY38yj8uCNIm1E7jaI9TJC494Pye2-Blq9eSK_eh6tI/edit?usp=sharing) - Office hour Zoom link [HERE](https://nyu.zoom.us/j/9985119253) - Office hour notes - Available on NYU Classes --- ## Today's Topics - Analyzing NYC Property Sales Data using Multiple Regression Modeling --- ## Introduction In this lab, we will explore a property market dataset using multiple regression modeling. We will examine how the sale price varied across a 12-month period in 2016 - 2017. Specifically, we will try to understand the covariates that affect the sale price. Our research questions are: 1. How did the sale price of NYC properties vary in 2016 - 2017? 2. What contributed to the increase of the sale price? 3. How do we predict the sale price based on the columns we have? ## Design We will conduct an Exploratory Data Analysis (EDA), followed by a multiple regression analysis, to try to answer these questions. --- ## Dataset The dataset records every building or building unit (apartment, etc.) sold in the New York City property market over a 12-month period in 2016 - 2017. (Source: [Kaggle](https://www.kaggle.com/new-york-city/nyc-property-sales)) The dataset contains the location, address, type, sale price, and sale data of building units sold. [Dataset file here](https://drive.google.com/file/d/1JjV_UZW6eCeAuqyTY3yV5W3WZdZ9ve2A/view?usp=sharing) First, load the data into R: ```r dat <- read.csv("nyc-rolling-sales.csv", header = TRUE, * na.strings = c("NA", " - ")) dim(dat) ``` ``` ## [1] 84548 22 ``` Number of observations: 84,548 Number of columns: 22 Column Names: ```r colnames(dat) ``` ``` ## [1] "X" "BOROUGH" ## [3] "NEIGHBORHOOD" "BUILDING.CLASS.CATEGORY" ## [5] "TAX.CLASS.AT.PRESENT" "BLOCK" ## [7] "LOT" "EASE.MENT" ## [9] "BUILDING.CLASS.AT.PRESENT" "ADDRESS" ## [11] "APARTMENT.NUMBER" "ZIP.CODE" ## [13] "RESIDENTIAL.UNITS" "COMMERCIAL.UNITS" ## [15] "TOTAL.UNITS" "LAND.SQUARE.FEET" ## [17] "GROSS.SQUARE.FEET" "YEAR.BUILT" ## [19] "TAX.CLASS.AT.TIME.OF.SALE" "BUILDING.CLASS.AT.TIME.OF.SALE" ## [21] "SALE.PRICE" "SALE.DATE" ``` --- ## Codebook - Part 1 (Source: [Glossary of Terms for Property Sales Files - NYC](https://www1.nyc.gov/assets/finance/downloads/pdf/07pdf/glossary_rsf071607.pdf)) - X: integer: index - BOROUGH: integer: borough name - 1: Manhattan - 2: Bronx - 3: Brooklyn - 4: Queens - 5: Staten Island - NEIGHBORHOOD: character: neighborhood name - BUILDING.CLASS.CATEGORY: category: identify similar properties by broad usage (e.g. One Family Homes) without looking up individual Building Classes. - TAX.CLASS.AT.PRESENT: category: tax classes - Class 1: residential property of up to three units - Class 2: all other properties that are primarily residential - Class 3: property owned by a gas, telephone or electric company - Class 4: all other commercial properties - BLOCK: integer: sub-division of the borough - LOT: integer: a tax lot - EASE.MENT: logical: easement or not - BUILDING.CLASS.AT.PRESENT: category: constructive use - ADDRESS: character: address - APARTMENT.NUMBER: character: apt. number - ZIP.CODE: category: zip code - RESIDENTIAL.UNITS: integer: number of residential units - COMMERCIAL.UNITS: integer: number of commercial units - TOTAL.UNITs: integer: total number of units - LAND.SQUARE.FEET: character: land sqft - GROSS.SQUARE.FEET: character: gross sqft - YEAR.BUILT: integer: year built - BUILDING.CLASS.AT.TIME.OF.SALE: category: constructive use - SALE.PRICE: character: sale price - SALE.DATE: character: sale date --- ## Pre-process First, make column names to lowercase using the `stringr` package. Then, replace `.` with `_`. ```r *colnames(dat) <- str_to_lower(colnames(dat)) *colnames(dat) <- str_replace_all(colnames(dat), pattern = "[.]", replacement = "_") colnames(dat) ``` ``` ## [1] "x" "borough" ## [3] "neighborhood" "building_class_category" ## [5] "tax_class_at_present" "block" ## [7] "lot" "ease_ment" ## [9] "building_class_at_present" "address" ## [11] "apartment_number" "zip_code" ## [13] "residential_units" "commercial_units" ## [15] "total_units" "land_square_feet" ## [17] "gross_square_feet" "year_built" ## [19] "tax_class_at_time_of_sale" "building_class_at_time_of_sale" ## [21] "sale_price" "sale_date" ``` --- ## Pre-process - Examine if there is any missing value. ```r n_col <- ncol(dat) paste("Column Index", "Number of NAs", sep = ", ") ``` ``` ## [1] "Column Index, Number of NAs" ``` ```r for (index in 1:n_col) { print(paste(index, ", ", colnames(dat)[index], ", ", sum(is.na(dat[, index])), sep = "")) } ``` ``` ## [1] "1, x, 0" ## [1] "2, borough, 0" ## [1] "3, neighborhood, 0" ## [1] "4, building_class_category, 0" ## [1] "5, tax_class_at_present, 0" ## [1] "6, block, 0" ## [1] "7, lot, 0" ## [1] "8, ease_ment, 84548" ## [1] "9, building_class_at_present, 0" ## [1] "10, address, 0" ## [1] "11, apartment_number, 0" ## [1] "12, zip_code, 0" ## [1] "13, residential_units, 0" ## [1] "14, commercial_units, 0" ## [1] "15, total_units, 0" ## [1] "16, land_square_feet, 26252" ## [1] "17, gross_square_feet, 27612" ## [1] "18, year_built, 0" ## [1] "19, tax_class_at_time_of_sale, 0" ## [1] "20, building_class_at_time_of_sale, 0" ## [1] "21, sale_price, 14561" ## [1] "22, sale_date, 0" ``` --- ## Trim the dataset Drop unrelated columns. ```r col_name_to_drop <- c("x", "ease_ment", "neighborhood", "building_class_at_present", "tax_class_at_present", "address", "apartment_number", "tax_class_at_time_of_sale") col_index_to_drop <- which(colnames(dat) %in% col_name_to_drop) dat <- dat[, -col_index_to_drop] dim(dat) ``` ``` ## [1] 84548 14 ``` Drop rows with missing values. ```r col_name_w_NA <- c("sale_price", "land_square_feet", "gross_square_feet") col_index_w_NA <- which(colnames(dat) %in% col_name_w_NA) row_w_NA <- which(is.na(dat[, col_index_w_NA[1]])) dat <- dat[-row_w_NA, ] row_w_NA <- which(is.na(dat[, col_index_w_NA[2]])) dat <- dat[-row_w_NA, ] row_w_NA <- which(is.na(dat[, col_index_w_NA[3]])) dat <- dat[-row_w_NA, ] dim(dat) ``` ``` ## [1] 48244 14 ``` ```r table(is.na(dat)) ``` ``` ## ## FALSE ## 675416 ``` --- ## Descriptive Analysis ```r colnames(dat) ``` ``` ## [1] "borough" "building_class_category" ## [3] "block" "lot" ## [5] "zip_code" "residential_units" ## [7] "commercial_units" "total_units" ## [9] "land_square_feet" "gross_square_feet" ## [11] "year_built" "building_class_at_time_of_sale" ## [13] "sale_price" "sale_date" ``` ```r dat$sale_date <- as.Date(dat$sale_date) ``` --- ## Borough ```r p <- ggplot(dat, aes(x = factor(borough))) + geom_histogram(stat="count") + scale_x_discrete(labels = c("Man", "Bronx", "Brooklyn", "Queens", "SI")) + theme_minimal() ggplotly(p) ```
--- ## Average Sale Price Per Borough ```r average_sales_per_borough <- dat %>% group_by(borough) %>% summarize(total = n(), average_sales = mean(sale_price)) average_sales_per_borough ``` ``` ## # A tibble: 5 x 3 ## borough total average_sales ## <int> <int> <dbl> ## 1 1 1006 18517220. ## 2 2 7049 590194. ## 3 3 24047 834488. ## 4 4 11078 912305. ## 5 5 5064 528604. ``` --- ## Viz ```r p <- ggplot(average_sales_per_borough, aes(x = borough, y = average_sales)) + geom_point() + scale_x_discrete(labels = c("Man", "Bronx", "Brooklyn", "Queens", "SI")) + theme_minimal() ggplotly(p) ```
--- ## Building Class Category ```r building_class_cat <- dat %>% group_by(borough, building_class_category) %>% summarize(total = n()) %>% arrange(desc(total), borough) building_class_cat ``` ``` ## # A tibble: 146 x 3 ## # Groups: borough [5] ## borough building_class_category total ## <int> <chr> <int> ## 1 3 "02 TWO FAMILY DWELLINGS " 6040 ## 2 4 "01 ONE FAMILY DWELLINGS " 5927 ## 3 4 "02 TWO FAMILY DWELLINGS " 3683 ## 4 5 "01 ONE FAMILY DWELLINGS " 3529 ## 5 3 "01 ONE FAMILY DWELLINGS " 3497 ## 6 3 "13 CONDOS - ELEVATOR APARTMENTS " 2817 ## 7 3 "03 THREE FAMILY DWELLINGS " 2175 ## 8 3 "10 COOPS - ELEVATOR APARTMENTS " 2158 ## 9 2 "02 TWO FAMILY DWELLINGS " 1909 ## 10 3 "07 RENTALS - WALKUP APARTMENTS " 1605 ## # … with 136 more rows ``` --- ## Viz ```r p <- ggplot(building_class_cat, aes(x = building_class_category, y = total)) + geom_point() + theme_classic() ggplotly(p) ```
--- ## Zip code ```r zip_code <- dat %>% group_by(zip_code) %>% summarize(average_sales = mean(sale_price)) %>% arrange(desc(average_sales)) zip_code ``` ``` ## # A tibble: 180 x 2 ## zip_code average_sales ## <int> <dbl> ## 1 10167 2210000000 ## 2 10004 376053600 ## 3 10005 311668843 ## 4 10006 203345000 ## 5 10001 90572732. ## 6 10038 34163229. ## 7 10018 33632200 ## 8 10010 33538515. ## 9 10016 33370155. ## 10 10017 29581143. ## # … with 170 more rows ``` --- ## Viz ```r p <- ggplot(zip_code, aes(x = factor(zip_code), y = average_sales)) + geom_point() + theme_classic() ggplotly(p) ```
--- ## Other Columns ```r summary(dat$land_square_feet) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 1413 2140 3358 3071 4228300 ``` ```r summary(dat$year_built) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0 1920 1931 1828 1961 2017 ``` ```r summary(dat$total_units) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.000 1.000 1.000 2.835 2.000 2261.000 ``` --- ## Check correlation ```r cor_dat <- cor(dat[, -c(2, 12, 14)]) *corrplot::corrplot(cor_dat, type = "lower") ``` <!-- --> --- ## Let's Build Our Model Together! ## Contact Tong Jin - Email: tj1061@nyu.edu - Office Hours - Mondays, 9 - 10am (EST) - Wednesdays, 12:30 - 1:30pm (EST)